Skip to content

feat: add component contributor test harness#508

Open
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:feature/component-test-harness
Open

feat: add component contributor test harness#508
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:feature/component-test-harness

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Contributor

Summary

Validate AICR components end-to-end with a single command — no GPU hardware required for most components.

make component-test COMPONENT=cert-manager
  • Three test tiers (auto-detected from registry.yaml): scheduling (KWOK redirect), deploy (Kind + bundle + health check), gpu-aware (Kind + nvml-mock + deploy + health check)
  • nvml-mock integration using ghcr.io/nvidia/nvml-mock:0.1.0 for GPU simulation in Kind clusters (arm64 + amd64, includes nvidia-smi)
  • Bundler bugfix: deploy.sh template now conditionally includes --version flag — fixes broken helm commands for components without defaultVersion in registry (e.g., gpu-operator)

New files

  • tools/component-test/ — 7 scripts (detect-tier, ensure-cluster, setup-gpu-mock, deploy-component, run-health-check, cleanup), Kind config, nvml-mock manifest, README
  • Makefile targets: component-test, component-detect, component-cluster, component-deploy, component-health, component-cleanup
  • Documentation updates in DEVELOPMENT.md and CONTRIBUTING.md

Test Plan

  • make test — all unit tests pass (72.1% coverage)
  • make component-test COMPONENT=cert-manager — deploy tier end-to-end (build → deploy → health check → cleanup)
  • make component-test COMPONENT=gpu-operator TIER=gpu-aware — gpu-aware tier end-to-end (build → nvml-mock → deploy → health check → cleanup)
  • make component-test COMPONENT=cert-manager TIER=scheduling — scheduling tier redirects to KWOK
  • New tests: TestGenerateDeployScript_EmptyVersionOmitsFlag, TestGenerateDeployScript_WithVersionIncludesFlag

@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Apr 8, 2026

So rather than go with mock GPUs is there a way we could have a CPU flavor?

I like that pattern for llama.cpp or vllm.

@ArangoGutierrez
Copy link
Copy Markdown
Contributor Author

So rather than go with mock GPUs is there a way we could have a CPU flavor?

I like that pattern for llama.cpp or vllm.

Good question — the harness actually already has a GPU-free path. The deploy tier validates components in plain Kind without any GPU mock (cert-manager, kai-scheduler, etc. use this today).

The nvml-mock layer is specifically for components that gate on GPU presence during init — gpu-operator, nvidia-device-plugin, DRA driver — they won't even start their reconciliation loop unless they
detect NVML libraries and device nodes on the host. There's no CPU flavor of those because their entire purpose is managing GPU hardware.

For inference workloads like llama.cpp or vLLM, a CPU flavor would make sense as a complementary pattern — deploy the serving stack with a CPU backend and validate the end-to-end request path. That's a
higher-level integration test than what this harness targets (component deployment + health check), but it could be built on top of it.

So both patterns have a place:

  • nvml-mock: GPU infrastructure components that check for hardware at init
  • CPU flavors: inference/serving workloads that can run with CPU backends

@ArangoGutierrez ArangoGutierrez force-pushed the feature/component-test-harness branch from d84bc0a to 45ddbbe Compare April 8, 2026 19:08
Copy link
Copy Markdown
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! This should help me a lot of Kueue work.

@ArangoGutierrez
Copy link
Copy Markdown
Contributor Author

CI is passing, ready for review @yuanchen8911 / @mchmarny

@yuanchen8911
Copy link
Copy Markdown
Contributor

yuanchen8911 commented Apr 8, 2026

Cross-Review Summary for PR #508

Reviewers: Claude Code, Codex, CodeRabbit + Integration Analysis
Rounds: 1 + Codex follow-up
Consensus reached: Yes

Confirmed Issues

# Severity Finding Confirmed By
1 Low cleanup.sh interactive read prompt blocks without a TTY — When DELETE_CLUSTER=true and FORCE_CLEANUP is unset, cleanup.sh#L95-L109 calls read -r -p which hangs in non-interactive environments. Not on the main component-test happy path, but the README documents make component-cleanup DELETE_CLUSTER=true as a supported command. Codex + CodeRabbit
2 Low Scheduling tier silently succeeds without testingmake component-test COMPONENT=<scheduling-component> exits 0 via Makefile#L602-L619 and ensure-cluster.sh#L46-L55 after printing guidance. Conflicts with the README's promise that the harness "auto-detects the right test tier, creates a Kind cluster, deploys the component, and runs its health check." Codex + CodeRabbit

Cross-review by Claude Code + Codex + CodeRabbit

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments.

@yuanchen8911 yuanchen8911 self-requested a review April 9, 2026 00:29
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two issues. Both are low severity, but they affect correctness of the new contributor workflow, so I would suggest changes.

@ArangoGutierrez ArangoGutierrez force-pushed the feature/component-test-harness branch from 2844c50 to 481035d Compare April 9, 2026 06:20
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — both review issues are addressed in 481035d. Needs a rebase onto main and re-approval from @kannon92

Validate AICR components end-to-end with a single command:

  make component-test COMPONENT=cert-manager

Three test tiers, auto-detected from registry.yaml:
- scheduling: redirects to existing KWOK infrastructure
- deploy: Kind cluster + aicr bundle + chainsaw health check
- gpu-aware: Kind + nvml-mock DaemonSet + deploy + health check

New files:
- tools/component-test/{detect-tier,ensure-cluster,setup-gpu-mock,
  deploy-component,run-health-check,cleanup}.sh
- tools/component-test/{kind-config.yaml,manifests/nvml-mock.yaml,README.md}

Makefile targets: component-test, component-detect, component-cluster,
component-deploy, component-health, component-cleanup.

Uses ghcr.io/nvidia/nvml-mock:0.1.0 for GPU simulation in Kind clusters
(arm64+amd64, includes nvidia-smi).

Tested end-to-end:
- deploy tier: cert-manager (build → deploy → health check → cleanup)
- gpu-aware tier: gpu-operator (build → nvml-mock → deploy → health check → cleanup)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The deploy.sh template unconditionally included '--version {{ .Version }}'
which produced a broken helm command when Version was empty (e.g.,
gpu-operator has no defaultVersion in registry.yaml). Helm 4 treats
the empty --version as a missing required argument.

The template now conditionally includes --version only when Version
is non-empty, allowing components without pinned versions to install
the latest chart from the repository.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- cleanup.sh: Detect non-interactive mode (no TTY) and fail with a
  clear error instead of hanging on 'read' when DELETE_CLUSTER=true
  without FORCE_CLEANUP=true.

- Makefile: Scheduling tier now exits with code 2 instead of 0 to
  signal that no test was executed, with guidance to use make kwok-e2e.

- README: Clarify that scheduling tier redirects to KWOK and does not
  create a Kind cluster.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the feature/component-test-harness branch from 481035d to 4584c25 Compare April 9, 2026 17:23
@ArangoGutierrez ArangoGutierrez requested a review from kannon92 April 9, 2026 17:31
@ArangoGutierrez
Copy link
Copy Markdown
Contributor Author

PTAL @kannon92 / @mchmarny

@kannon92
Copy link
Copy Markdown
Contributor

kannon92 commented Apr 9, 2026

I'm not an approver here but last I looked PR was good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants